Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CIAINFRA-282: Cluster V1 CR: add observedGeneration and OperatorQuiescent #201

Merged
merged 2 commits into from
Sep 2, 2024

Conversation

birdayz
Copy link
Contributor

@birdayz birdayz commented Aug 26, 2024

It is currently challenging to determine if a resource has been reconciled, and it is "ready":

  • No observedGeneration is exported in status. This makes it hard to determine, if the controller has already reconciled the cluster. Having observedGeneration in status, clients can just store the returned metadata.generation from the Update call, and wait until status.observedGeneration >= that value.
  • No field represents the controller having "done its work". The ClusterQuiescent condition represents this. If no outstanding work is known (business logic), and no errors are thrown, True is reported. Otherwise, false is reported.

@birdayz birdayz force-pushed the jb/observedGeneration-cluster-quiescent branch 4 times, most recently from 225ac04 to 7283cbd Compare August 26, 2024 16:24
@birdayz birdayz force-pushed the jb/observedGeneration-cluster-quiescent branch 2 times, most recently from 02266eb to 3c150f3 Compare August 26, 2024 16:26
@birdayz birdayz changed the title Cluster V1 CR: add observedGeneration and ClusterQuiescent Cluster V1 CR: add observedGeneration and OperatorQuiescent Aug 26, 2024
@birdayz birdayz force-pushed the jb/observedGeneration-cluster-quiescent branch 11 times, most recently from bc59e08 to 59d1514 Compare August 27, 2024 08:48
@@ -5,6 +5,8 @@ metadata:
status:
restarting: false
conditions:
- type: OperatorQuiescent
status: "False"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this case is false?

Copy link
Contributor Author

@birdayz birdayz Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upgrade is in progress (spec.version != status.version)

@birdayz birdayz force-pushed the jb/observedGeneration-cluster-quiescent branch from 59d1514 to 6fcced5 Compare August 27, 2024 08:53
Copy link
Contributor

@RafalKorepta RafalKorepta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please let's wait for @chrisseto to have a chance to review.

@birdayz birdayz force-pushed the jb/observedGeneration-cluster-quiescent branch 4 times, most recently from b864472 to 26f0806 Compare August 27, 2024 11:03
@birdayz birdayz changed the title Cluster V1 CR: add observedGeneration and OperatorQuiescent CIAINFRA-282: Cluster V1 CR: add observedGeneration and OperatorQuiescent Aug 27, 2024
@birdayz birdayz force-pushed the jb/observedGeneration-cluster-quiescent branch from 26f0806 to c7186a1 Compare August 27, 2024 12:17
@birdayz
Copy link
Contributor Author

birdayz commented Aug 27, 2024

Will switch tomorrow to a different strategy to fix the tests.
Adding the new condition to their kuttl assertions changes test behavior, i do not want that.
Not adding it causes errors, because kuttl can't assert only parts of an array.. kudobuilder/kuttl#76 (comment)

Instead, i will remove the conditions from the assertion yamls, and add these:

apiVersion: kuttl.dev/v1beta1
kind: TestAssert
commands:
- timeout: 300
  script: |
    kubectl wait --for=condition=ClusterConfigured=True cluster/decommission --timeout 300s -n redpanda --namespace $NAMESPACE

@birdayz birdayz force-pushed the jb/observedGeneration-cluster-quiescent branch 7 times, most recently from 8847960 to 0a499e4 Compare August 28, 2024 08:37
@birdayz
Copy link
Contributor Author

birdayz commented Aug 28, 2024

Got a full green run locally, but CI so far flakes.

--- PASS: kuttl (4090.75s)
    --- PASS: kuttl/harness (0.00s)
        --- PASS: kuttl/harness/additional-cmdline-arguments (43.13s)
        --- PASS: kuttl/harness/lost-redpanda-decommission (114.26s)
        --- PASS: kuttl/harness/console (111.48s)
        --- PASS: kuttl/harness/external-connectivity (39.85s)
        --- PASS: kuttl/harness/kafkaapi-client-auth (93.27s)
        --- PASS: kuttl/harness/explicit-ports-tls (47.81s)
        --- PASS: kuttl/harness/endpoint-template (145.48s)
        --- PASS: kuttl/harness/create-topic-given-issuer-with-client-auth (138.23s)
        --- PASS: kuttl/harness/console-finalizers (70.90s)
        --- PASS: kuttl/harness/decommission (323.23s)
        --- PASS: kuttl/harness/create-topic-given-issuer (99.88s)
        --- PASS: kuttl/harness/console-admin-api (115.20s)
        --- PASS: kuttl/harness/centralized-configuration-bootstrap (103.68s)
        --- PASS: kuttl/harness/console-kafka-mtls (128.48s)
        --- PASS: kuttl/harness/confluent-schema-registry (75.01s)
        --- PASS: kuttl/harness/centralized-configuration-drift (101.42s)
        --- PASS: kuttl/harness/centralized-configuration-upgrade (177.08s)
        --- PASS: kuttl/harness/centralized-configuration-tls (366.68s)
        --- PASS: kuttl/harness/admin-api-tls (57.44s)
        --- PASS: kuttl/harness/admin-api-tls-ext (55.85s)
        --- PASS: kuttl/harness/resources-cluster (26.01s)
        --- PASS: kuttl/harness/centralized-configuration (186.08s)
        --- PASS: kuttl/harness/pandaproxy-produce-consume-tls-client-external-ca (157.37s)
        --- PASS: kuttl/harness/regressions (250.52s)
        --- PASS: kuttl/harness/redpanda-schema-registry-sasl (268.48s)
        --- PASS: kuttl/harness/redpanda-schema-registry (138.26s)
        --- PASS: kuttl/harness/user-specified-serviceaccount-name (10.28s)
        --- PASS: kuttl/harness/produce-tls (65.76s)
        --- PASS: kuttl/harness/pandaproxy-produce-consume (50.24s)
        --- PASS: kuttl/harness/pandaproxy-produce-consume-tls-client (58.23s)
        --- PASS: kuttl/harness/pandaproxy-produce-consume-sasl (53.65s)
        --- PASS: kuttl/harness/update-image-tls-no-client-auth (751.99s)
        --- PASS: kuttl/harness/update-image-and-node-port (882.82s)
        --- PASS: kuttl/harness/update-image-tls-client-auth-external-ca (192.69s)
        --- PASS: kuttl/harness/node-select-tolerations (14.09s)
        --- PASS: kuttl/harness/update-image-tls-client-auth-no-external-ca (661.78s)
        --- PASS: kuttl/harness/admin-api (62.34s)
        --- PASS: kuttl/harness/shared-tls-cert (64.40s)
        --- PASS: kuttl/harness/update (581.67s)
        --- PASS: kuttl/harness/superusers-prefix (75.72s)
        --- PASS: kuttl/harness/additional-configuration (77.28s)
        --- PASS: kuttl/harness/schema-registry-tls-client-external-ca (288.12s)
        --- PASS: kuttl/harness/managed-decommission (323.95s)
PASS

@birdayz birdayz force-pushed the jb/observedGeneration-cluster-quiescent branch 9 times, most recently from 1911401 to 0fa3045 Compare August 30, 2024 12:51
Name: observedCluster.Name,
}
cluster := &vectorizedv1alpha1.Cluster{}
err := c.Get(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For performing a patch request, I don't think you'd need to bother refreshing the cluster here. You just trying to push the changes performed by mutator up to the server and Patch doesn't return errors on ResourceVersion mismatches.

clusterPatch := client.MergeFrom(cluster.DeepCopy())
mutator(cluster)

err = c.Status().Patch(ctx, cluster, clusterPatch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: prefer inline error handling to avoid polluting the name err when possible. It also helps readers understand that err is only considered within the scope of the if and isn't intended to live further.

@chrisseto
Copy link
Contributor

Sorry for the delay! Those kuttl assertion were a pain. Thanks for fixing/updating all of them!

It is currently challenging to determine if a resource has been
reconciled, and it is "ready":

- No observedGeneration is exported in status. This makes it hard to
  determine, if the controller has already reconciled the cluster.
  Having observedGeneration in status, clients can just store the
  returned metadata.generation from the Update call, and wait until
  status.observedGeneration >= that value.
- No field represents the controller having "done its work". The
  ClusterQuiescent condition represents this. If no outstanding work is
  known (business logic), and no errors are thrown, True is reported.
  Otherwise, false is reported.
@RafalKorepta RafalKorepta force-pushed the jb/observedGeneration-cluster-quiescent branch from 0fa3045 to 560992b Compare September 2, 2024 08:34
@RafalKorepta RafalKorepta enabled auto-merge (rebase) September 2, 2024 08:35
@RafalKorepta RafalKorepta merged commit 32fb46c into main Sep 2, 2024
2 of 4 checks passed
RafalKorepta added a commit that referenced this pull request Sep 12, 2024
The strategic merge patch should help in integration tests flakiness as many
nighlty tests are failing with the following log:
```
Operation cannot be fulfilled on clusters.redpanda.vectorized.io \"central-removal\": the object has been modified; please apply your changes to the latest version and try again
```

That problem could happen due to addition of `Quiescent` condition in
#201
RafalKorepta added a commit that referenced this pull request Sep 12, 2024
The strategic merge patch should help in integration tests flakiness as many
nighlty tests are failing with the following log:
```
Operation cannot be fulfilled on clusters.redpanda.vectorized.io \"central-removal\": the object has been modified; please apply your changes to the latest version and try again
```

That problem could happen due to addition of `Quiescent` condition in
#201
RafalKorepta added a commit that referenced this pull request Sep 12, 2024
The strategic merge patch should help in integration tests flakiness as many
nighlty tests are failing with the following log:
```
Operation cannot be fulfilled on clusters.redpanda.vectorized.io \"central-removal\": the object has been modified; please apply your changes to the latest version and try again
```

That problem could happen due to addition of `Quiescent` condition in
#201
@RafalKorepta RafalKorepta deleted the jb/observedGeneration-cluster-quiescent branch December 2, 2024 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants